Creating a linguistic plausibility dataset with non-expert annotators

نویسندگان

  • Benjamin Lambert
  • Rita Singh
  • Bhiksha Raj
چکیده

We describe the creation of a linguistic plausibility dataset that contains annotated examples of language judged to be linguistically plausible, implausible, and every-thing in between. To create the dataset we randomly generate sentences and have them annotated by crowd sourcing over the Amazon Mechanical Turk. Obtaining inter-annotator agreement is a difficult problem because linguistic plausibility is highly subjective. The annotations obtained depend, among other factors, on the manner in which annotators are questioned about the plausibility of sentences. We describe our experiments on posing a number of different questions to the annotators, in order to elicit the responses with greatest agreement, and present several methods for analyzing the resulting responses. The generated dataset and annotations are being made available to public.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Complex Linguistic Annotation - No Easy Way Out! A Case from Bangla and Hindi POS Labeling Tasks

Alternative paths to linguistic annotation, such as those utilizing games or exploiting the web users, are becoming popular in recent times owing to their very high benefit-to-cost ratios. In this paper, however, we report a case study on POS annotation for Bangla and Hindi, where we observe that reliable linguistic annotation requires not only expert annotators, but also a great deal of superv...

متن کامل

Corpus Annotation as a Scientific Task

Annotation studies in CL are generally unscientific: they are mostly not reproducible, make use of too few (and often non-independent) annotators and use guidelines that are often something of a moving target. Additionally, the notion of ‘expert annotators’ invariably means only that the annotators have linguistic training. While this can be acceptable in some special contexts, it is often far ...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Addressing the Resource Bottleneck to Create Large-Scale Annotated Texts

Large-scale linguistically annotated resources have become available in recent years. This is partly due to sophisticated automatic and semiautomatic approaches that work well on specific tasks such as part-ofspeech tagging. For more complex linguistic phenomena like anaphora resolution there are no tools that result in high-quality annotations without massive user intervention. Annotated corpo...

متن کامل

Using Crowdsourcing to Provide Prosodic Annotations for Non-Native Speech

We present the results of an experiment in which 2 expert and 11 naive annotators provided prosodic annotations for stress and boundary tones on a corpus of spontaneous speech produced by non-native speakers of English. The results show that agreement rates were higher for boundary tones than for stress. In addition, a crowdsourcing approach was implemented to combine the naive annotations to i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010